In silico network topology-based prediction of gene essentiality 
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The identification of genes essential for survival is important for tiie understanding of the minimal 
^ ' requirements for cellular life and for drug design. As experimental studies with the purpose of 

, building a catalog of essential genes for a given organism are time-consuming and laborious, a 

^0 ' computational approach which could predict gene essentiality with high accuracy would be of great 

value. We present here a novel computational approach, called NTPGE (Network Topology-based 
Prediction of Gene Essentiality), that relies on network topology features of a gene to estimate its 
essentiality. The first step of NTPGE is to construct the integrated molecular network for a given 
organism comprising protein physical, metabolic and transcriptional regulation interactions. The 
' second step consists in training a decision tree-based machine learning algorithm on known essential 

O, and non-essential genes of the organism of interest, considering as learning attributes the network 

topology information for each of these genes. Finally, the decision tree classifier generated is applied 
Q ' to the set of genes of this organism to estimate essentiality for each gene. We applied the NTPGE 

• 1— ( approach for discovering essential genes in Escherichia coli and then assessed its performance. 
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I. INTRODUCTION 

> 

I Essential genes are genes that are indispensable to support cellular life. These genes constitute a minimal set of 
p\| . genes required for a living cell. Therefore, the functions encoded by this gene set are essential and could be considered 
as a foundation of life itself |Jj, . The identification of genes which are essential for survival is important not only 
for the understanding of the minimal requirements for cellular life, but also for practical purposes. For example, since 
most antibiotics target essential cellular processes, essential gene products of microbial cells are promising new targets 
for such drugs [3]. The prediction and discovery of essential genes has been performed by experimental procedures 
such as single gene knockouts Q, RNA interference Q and conditional knockouts @, but each of these techniques 
^ \ require a large investment of time and resources and they are not always feasible. 

Considering these experimental constraints, a computational or in silico approach capable of accurately predicting 
gene essentiality would be of great value. Some of such predictors have been already developed in which sequence 
features of genes and proteins with or without homology comparison have been utilized as parameters for training 
machine learning classifiers for gene essentiality prediction 0, ■ addition, predictors of gene essentiality based on 
network topology features, as the physical interactions of a protein [9| or the number of biochemical species that are 
knocked out from the metabolic network following a gene deletion [l^, [ll[ have also been developed. 

The currently available network topology-based methodologies of gene essentiality prediction use only one type of 
network topology feature, i.e. protein physical interactions or metabolic interactions, for performing such predictions. 
Actual molecular interaction networks, however, are composed by entities that are intricately connected with diverse 
types of interactions, such as protein physical, metabolic and transcriptional regulation interactions. 

We therefore propose here a novel machine-learning based in silico approach, called NTPGE (Network Topology- 
based Prediction of Gene Essentiality) , that relies on multiple topological network features of a given gene to estimate 
its essentiality. For the generation of the decision tree classifier, NTPGE employs the following network topological 
features as learning attributes: number of physical interactions for the corresponding encoded protein, number of 
target genes transcriptionally regulated by the corresponding encoded transcription factor, number of transcription 
factors that regulate it, number of enzymes that use metabolites produced by the corresponding encoded enzyme as 
reactants and number of enzymes that produce metabolites used as reactants by the corresponding encoded enzyme. 
To assess the performance of the NTPGE approach, we used it for the discovery of essential genes in the bacterium 
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Escherichia coli, a model organism whose most of genes have aheady been characterized experimentaUy as essential 
or non-essential. 



II. CONSTRUCTION OF THE IMN OF E. COLI 



As NTPGE relies on topological features of molecular network, the first step was to construct the Escherichia coli in- 
tegrated molecular network (IMN) comprising protein physical, metabolic and transcriptional regulation interactions. 
For this purpose, we used MONET (MOlecular NETwork) ontology, a tool developed by our group that facilitates 
the construction of IMNs of organisms via integration of information from metabolic pathways, protein-protein inter- 
action networks and transcriptional regulation interactions through a model able to minimize data redundancy and 
inconsistency As previously described, two genes of a given organism, gi and 52, coding for proteins pi and p2 
are linked if: 

• pi and p2 interact physically, 

• gi regulates the transcription of gene 32 , 

• or one metabolite generated by a reaction catalyzed by pi is consumed in a reaction catalyzed by p2 (we may 
exclude from this analysis the most used compounds such as ATP, NAD, H20, etc.). 

The data sources present in MONET ontology used for the construction of the E. coli IMN were KEGG (Kyoto 
Encyclopedia of Genes and Genomes [ll] for metabolic interactions, RegulonDB for transcriptional regulation 
interactions, and Butland et al ^iB^ for protein physical interactions. 

Using MONET, we constructed two directed IMNs of E. coli, Ga and Gp. Ga contained all possible interactions 
among genes with 1,998 genes and 51,642 interactions. Gp was similar to Ga, except that the connections through 
the ten most frequently used compounds on the metabolism were deleted producing a network with 1,987 genes and 
21,338 interactions, since connections via these common compounds is not likely to be important for the determination 
of gene essentiality due to their promiscuity. 



III. BRIEF ANALYSIS OF THE ESCHERICHIA COLI IMNS 



Prior to use the Escherichia coli IMNs for the validation of the NTPGE approach, we present here a brief analysis 
of the most common network measures, i.e. degree distribution and clustering coefficient, of these IMNs. The degree 
distribution, P(fc), gives the probability that a selected node has exactly k hnks. P{k) is obtained by counting the 
number of nodes N{k) with k — 1, 2,... links and dividing by the total number of nodes N. The clustering coefficient, 
Ci, gives the density of triangles we can construct in the network having the node i as a vertex. The clusterization 
coefficient is defined as: 

ki{ki 1) 

where Ui is the number of links connecting the ki neighbors of the node i. The average clustering coefficient C is the 
clustering coefficient for the whole network and characterizes the overall tendency of nodes to form clusters or groups. 

In Figure [1] we show the histogram of degree distribution for Ga and Gp. The curves are well approximated by a 
power law function, P{k) = Ak~'^ for both IMNs, suggesting that Ga and Gp are scale-free networks. 
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FIG. 1: Histogram of the degree distribution for Ga and Gp used in this work. Both Ga (solid line) and Gp (dashed line) are 
well described by a power law function P(fc) = Ak"'^ that characterizes them as scale-free networks 



We also analyzed the dependence of the average clusterization coefficient, C, on the connectivity fc, defined as C{k). 
For a traditional scale- free network, we expect C{k) not to depend on fc, while for hierarchical networks we expect 
C{k) ^ k^". Figure [2] shows the C{k) for Ga and Gp. These results point to a C(fc) not dependent on k for Ga and a 
C(k) dependent on k for Gp, thus indicating that Ga is a non-hierarchical IMN and Gp is an hierarchical IMN. This 
shift from a non-hierarchical topology for Ga to an hierarchical topology for Gp seems to be caused by the deletion of 
the connections through the ten most frequently used compounds in the metabolism on the construction of Gp. Such 
compounds induce a strongly connected IMN due to their promiscuity. 




FIG. 2: The dependence of the average clusterization coefficient G on the connectivity k. The best-fit regression line for 
Ga (solid line) has a regression slope of —0.03 with a confidence interval of [—0.08,0.01], while the best-fit regression line 
for Gp (dashed line) has a regression slope of 0.28 with a confidence interval of [0.22,0.33]. The results show that Ga is a 
non-hierarchical scale-free network, whereas Gp is an hierarchical scale-free network. 
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IV. DESCRIPTION OF THE NTPGE APPROACH 

The NTPGE approach was performed using WEKA {Waikato Environment for Knowledge Analysis) system [l^. 
WEKA is a collection of machine learning algorithms for data mining tasks. It also provides means for data pre- 
processing, classification, regression, clustering, association rules, and visualization [l6j . Among these algorithms, we 
used the J48 [l^, which is the Weka's implementation of the well known C4.5 17| that uses the greedy technique to 



induce decision trees for classification. A decision-tree model is built by analyzing training data, which is then used 
to classify unseen data. 

We trained the J48 algorithm on four different training configurations [ti, t2, and ^4). In all configurations, the 
training data was a set of known essential and non-essential genes of Escherichia coli taken from the PEC database 
{Profiling of Escherichia coli chromosome, http://www.shigen.nig.ac.jp/ecoli/pec/). The PEC database has been 
compiled experimental information on Escherichia coli strains from research reports and deletion mutation studies 
prior to 1998, including gene essentiality for cell growth. Based on these reports about gene essentiality for cell growth, 
the E, coli genes are classified in essential, non-essential and unknown, http://www.shigen.nig.ac.jp/ecoli/pec In all 
training configurations, for a given gene, the learning attributes used were as follows: 

• number of physical interactions for the corresponding encoded protein; 

• number of target genes transcriptionally regulated by the corresponding encoded transcription factor 
(regulation_out); 

• number of transcription factors that regulate it; (regulation_in); 

• number of enzymes that use metabolites produced by the corresponding encoded enzyme as reactants 
(metabolism_out); 

• number of enzymes that produce metabolites used as reactants by the corresponding encoded enzyme 
(metabolism_in); 

In ti and t2, the above mentioned attributes were extracted from G^, whereas these same attributes were extracted 
from Gp in ^3 and t^. Moreover, the attribute damage, which was not originally present in Ga and Gp, was included 
in t2 and ^4, The damage d is defined as the number of metabolites whose production was prevented by the deletion 
of the enzyme. For a given enzyme, its damage d has been shown to be strongly correlated to its essentiality, [l^ . 

The J48 algorithm was trained with the parameters presented in Table I. As it has been known that data imbalance 
is one of the causes that degrade the performance of machine learning algorithms ^19*1 , we replicated the data related 
to the essential genes in order to correct data imbalance as the number of non-essential genes is much larger than the 
number of essential genes. 



V. PERFORMANCE OF THE NTPGE APPROACH AND RELATED DISCUSSION 

The performance of the NTPGE approach was evaluated by testing the classifiers created by the J48 algorithm, 
as described above, on the training data itself. The selection of the best training configuration to be considered as 
default by the NTPGE approach was performed based on the F-measure of the corresponding generated classifier. 
The F-measure provides an harmonic mean of precision and recall and is defined as: 

p 2 X precision x recall 
precision -I- recall ' 

Precision (the percentage of correctly classified instances) and recall (the percentage of positive labeled instances 
that were classified as such) were calculated from the confusion matrices of the classifiers obtained from the training 
configurations ^i, ^2, ^3 and (Tables II) and are shown on Table III. Table III also shows the F-measure as well 
as the features of the training configurations, as the number of instances (genes plus metabolites) and presence or 
absence of the learning attribute damage d on training. 

According to Table III, the best training configuration was ti (all genes and metabolites with the attribute damage). 
Its corresponding generated classifier had a F-measure of 83.4% for essential genes and 79.7% for non-essential genes. 
In fact, all generated classifiers yielded similar results, suggesting that the presence or absence of the ten most used 
compounds in metabolism or the presence or absence of the attribute damage d did not affect the classification of 
genes as essential or non-essential by the NTPGE approach. Therefore, any training configuration could be selected 
as defauh by NTPGE. 
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TABLE I: Parameters used to run the J48 algorithm on training data. 



Parameter 


Value 


binarySplit 


False 


confidenceFactor 


0.25 


debug 


False 


minNumObj 


100 


numFolds 


3 


reduceErrorPruning 


False 


savelnstanceData 


False 


seed 


1 


subtreeRaising 


True 


un pruned 


False 


useLaplace 


False 



TABLE IL Confusion matrices of the classifiers generated from ti, t2, ts and t4 



tl 


Predicted 




Non-essential Essential 


Actual 


1,392 397 


Non-essential 


310 1,780 


Essential" 




Predicted 




Non-essential Essential 


Actual 


1,348 405 


Non-essential 


313 1,777 


EssentiaT 


ts 


Predicted 




Non-essential Essential 


Actual 


1,346 432 


Non-essential 


298 1,792 


Essential" 


f.i 


Predicted 




Non-essential Essential 


Actual 


1,348 430 


Non-essential 


300 1,790 


Essential" 



"'The number of essential genes were replicated to avoid data imbalance. Actually, the number of essential genes is 209 
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TABLE III: Features of the training configurations and performance measures of their corresponding generated classifiers. 



Features and Performance IVIeasures Training configurations 







ti 


t2 


ts 


u 


Number of Genes" 




3,879 3,879 3,868 3,868 


Damage d 




no 


yes 


no 


yes 


Correctly Predicted Genes 


(%) 


81.8 


81.5 


81.1 


81.1 


Incorrectly Predicted Genes 


(%) 


18.2 


18.5 


18.9 


18.9 


F-measure (N) (%) 




79.7 


79.4 


78.7 


78.7 


F-measure (E) (%) 




83.4 


83.2 


83.1 


83.1 


Recall (N) (%) 




77.8 


77.4 


75.7 


75.8 


Recall (E) (%) 




85.2 


85.0 


85.7 


85.6 


Precision (N) (%) 




81.8 


81.6 


81.9 


81.8 


Precision (E) (%) 




81.8 


81.4 


80.6 


80.6 



"The number of essential genes were replicated to avoid data imbalance; number of non-essential genes remained unchanged. Actually, 
the number of essential genes is 209 and non-essential genes is 1,789 for Ga and the number of essential genes is 209 and non-essential 
genes is 1,778 for Gp 




FIG. 3: Decision tree generated from ti with a F-measure of 83.4% for essential genes (E) and 79.7% for non-essential genes 
(N). The (x/y) inside rectangles denotes the number of correctly classified genes {x) and the number of incorrectly classified 
genes (y). 

Figure [3] shows the set of rules of the decision tree generated from ti . The top node of the tree corresponds 
to the attribute protein physical interaction. This means that the classification tree algorithm concluded that the 
main factor to define essentiality in E. coli was the protein physical interaction. In fact, the degree of a protein 
has been documented in the literature as being indicative of essentiality in various organisms 0, [20l [2lj. In our 
approach, a combination of intermediate number of protein physical interactions with at least one interaction of the 
type metabolism_in, i.e. number of enzymes that produce metabolites used as reactants by the corresponding encoded 
enzyme, was also indicative of essentiality. Transcriptional regulation interactions seems not to be a good predictor 
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for gene essentiality, since genes with at least one interaction of the type regulation_out, i.e. number of target 
genes transcriptionally regulated by the corresponding encoded transcription factor, were classified as non-essential. 
Moreover, the attribute (regulation_in, i.e. the number of transcription factors that regulate a given gene, was not 
even included in the decision tree. These results regarding gene essentiality and transcriptionall regulation are not 
surprising, since transcription factors are usually not essential under the conditions in which the knockout experiments 
for determining gene essentiality are performed (PEC database, http://www.shigen.nig.ac.jp/ecoli/pec/) 



VI. CONCLUDING REMARKS 



We proposed here a novel machine learning-based computational approach, called NTPGE (Network Topology- 
based Prediction of Gene Essentiality), that relies on network topology features of a gene to estimate its essentiality. 
Distinct from previous network topology-based gene essentiality predictors, NTPGE employs multiple topological 
network features of a given gene to estimate its essentiality, namely physical interactions for the corresponding encoded 
protein, number of target genes transcriptionally regulated by the corresponding encoded transcription factor, number 
of transcription factors that regulate it, number of enzymes that use metabolites produced by the corresponding 
encoded enzyme as reactants and number of enzymes that produce metabolites used as reactants by the corresponding 
encoded enzyme. 

We verified the performance of NTPGE by applying it for the discovery of essential genes in the bacterium Es- 
cherichia coli, a model organism whose most of genes have already been characterized experimentally as essential or 
non-essential. Among the interactions considered as learning attributes, NTPGE relied mostly in protein physical 
and metabolic interactions for gene essentiality prediction. In addition, the presence or absence of the ten most used 
compounds in metabolism or the presence or absence of the attribute damage d did not likely influence the classifi- 
cation of genes as essential or non-essential by NTPGE. This can be concluded because the F-measure values of all 
generated decision trees were similar. Anyway, the best classifier was generated from ti (all genes and metabolites 
with the attribute damage) with a F-measure of 83.4% for essential genes and 79.7% for non-essential genes. 

In conclusion, the NTPGE seems to be a reliable method of gene essentiality discovery that may be applied to the 
gene set of other organisms. However, NTPGE is limited to organisms whose corresponding IMN has already been 
constructed. The construction of the IMN of a given organism involves the gathering of experimentally determined 
data that are not always available, particularly for a newly sequenced organism. To overcome this limitation, future 
developments would be the integration of NTPGE with sequence-based methods of IMN construction, thus creating 
a purely in silico network topology information-based methodology of gene essentiality discovery. 
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